Introduction

This is my report for the final project of the Coding 2: Web scraping course at the Central European University.

In this project I had to do two things:

  1. Download the top 100 tennis player historically
  2. Select the year 2010 and download all the player statistics

I was also advised to use the techniques that I learned during the course: write functions, use lapply, and rbindlist to create dataframes and create a report on the findings.

Preparatory steps

First, I clean the environment and install the required packages.

#clean environment
rm(list =ls())

#loading packages with pacman
if (!require("pacman")) {
  install.packages("pacman")
}

pacman::p_load(rvest, data.table, xml2,kableExtra,ddplot,lubridate, ggrepel, gridExtra, readr, dplyr)

#load ddplot package for racing bar chart
remotes::install_github("feddelegrand7/ddplot", build_vignettes = TRUE)

Ranking table historically

Creating the database

Next, I read in the base html and create a list from the drilldown of dates on the site, which I will use to create all the links to download the historical data.

#read in base website
t <- read_html("https://www.atptour.com/en/rankings/singles")

#creating function to remove unnecessary text
clean_up <- function(x) {
  stringr::str_replace_all(x, "[\r\t\n]", "")
}

#get all the dates for which there is a list of top 100
dates <- t %>% html_nodes(".dropdown-holder-wrapper:nth-child(1) li") %>% html_text() %>% lapply(clean_up)

#first date is twice, and there are two dates for which there is absolutely no data on the site (they are bugs), so we remove these from the dates
dates <- dates[-1] 

dates <- dates[dates != "1985.03.03"]

dates <- dates[dates != "1976.03.01"]

#creating the links
links <- paste0('https://www.atptour.com/en/rankings/singles?rankDate=', dates)

Next, I create a function that gets the ranking for one date.

#write function that gets the top 100 tennis players for one date from atptour.com 
#and returns with a dataframe of the ranking, move, country, player, age, tournaments played, points dropping and next best
get_top100 <- function(t_url) {
  t <- read_html(t_url)
  
  df <- t %>% html_nodes(xpath ='//*[@id="rankingDetailAjaxContainer"]/table') %>%
    html_table()
  
  df <- df[[1]]
  
  df <- df %>% mutate("Country"  = substr(t %>% html_nodes(".country-item") %>% html_nodes("img") %>%
                                                  html_attr("src"), 26, 28) %>% toupper())
  
  df[["Ranking"]] <- as.numeric(df[["Ranking"]])
  
  df[["Points Dropping"]] <- as.numeric(df[["Points Dropping"]])
  
  df[["Move"]] <- as.numeric(df[["Move"]]) # change if NA to 0!!!
  
  temp <- t %>% html_nodes(".move-cell") %>% html_children() %>% html_attrs()
  
  df <- df %>% mutate(Move = case_when(temp[seq(1, length(temp), by = 2)] == "move-down" ~ Move * (-1), TRUE ~ Move))
  
  df [["Date"]] <- t %>% html_node(".dropdown-label") %>% html_text() %>% lapply(clean_up)
  
  df [["Date"]] <- as.Date(gsub(".", "-", df$Date,fixed = TRUE))
  
  df [["Points"]] <- as.numeric(gsub(",", "", df$Points,fixed = TRUE))
  
  names(df) <- c("ranking", "move","country", "player","age","points","tourn_played","points_dropping","next_best","date")
  
  return(df)
}

After we have the function, I use lapply and rbindlist to create the final dataframe.

#apply the function to all dates
list_of_dfs <- lapply(links, get_top100)

#bind the lists together to create the final dataframe
final_df <- rbindlist(list_of_dfs)

#Save the dataframe to a csv
write.csv(final_df,"data/final_df.csv", row.names = FALSE)

The dataframe has 217669 record and its head looks like this:

#read in the dataframe
final_df <- read_csv("data/final_df.csv")

#show its top 5 rows
head(final_df,5) %>%
  kbl() %>%
  kable_styling("basic") %>%
  kable_paper("hover", full_width = T)
ranking move country player age points tourn_played points_dropping next_best date
1 NA SRB Novak Djokovic 34 11540 14 0 0 2021-12-13
2 NA RUS Daniil Medvedev 25 8640 23 0 0 2021-12-13
3 NA GER Alexander Zverev 24 7840 23 0 0 2021-12-13
4 NA GRE Stefanos Tsitsipas 23 6540 26 0 0 2021-12-13
5 NA RUS Andrey Rublev 24 5150 28 0 0 2021-12-13

Visualizations

To get some insights about this vast dataset that I scraped, I create some visuals. For starters, I decided to create a racing bar chart showing the evolution of the number of points by the top 50 tennis players over time (I decided to restrict to top 50 because to see the full 100 requires a large screen). This interactive visual gives the viewer an idea of the evolution of the best tennis players over time. Since the dataframe only contains points from 1996-08-12 onwards, this is the starting date for the racing bar chart.

Having seen this evolution of players over time on the racing bar chart, I got curious about who were the players that were able to climb up the most on this list and who were the ones that somehow achieved a great position on the ranking table but then fell down spectacularly. On this next bar chart, I visualize the top 10 players that moved up the table and the top 10 players that moved down the table. Pat Cash and Vitas Gerulaitis turned out to be the players that improved the most, climbing a total of more than 300 places up over their careers. At the other end of the distribution, Tom Okker, Paolo Bertolucci, Jaime Fillol Sr. and Roscoe Tanner all went in the opposite direction more than 100 places cumulatively.

Now that we have seen the evolution of tennis ranking players and the players that moved the most on the table, I decided that it was time to also see the main point of such tables: I visualized the best players of all time. For this, I calculated the average ranking by player over their careers. As can be seen on the chart the all time great (from 1973 to now) is Bjorn Borg, Roger Federer comes second, while Rafael Nadal managed to seal the third place. Novak Djokovic and Ivan Lendl come in the fourth and fifth places.

Finally, I decided to utilize another aspect of the data, namely that for each player from the flags I was able to get the nationalities of the players. I group the entire dataset by the nationalities and calculate the average ranking by their players. Having created this view of the data, I visualize the top and bottom 5 countries that had at least one player who made it to the top 100 from 1973 to now. What we can see on the chart is that interestingly Greece, Bulgaria and less surprisingly Switzerland have the lowest average scores, while Kenya, Bahrein and Montenegro still have a lot of work to do to become dominant in tennis.

Player statistics for top 100 in 2010

After downloading the entire database historically, I was required to choose one year and download all the player statistics. I chose the ranking table at the end of 2010, as my favorite player Rafael Nadal led the ranking table at the time.

Creating the database

First, I read in the base html with the given date and get all the relative links pointing to the top 100 players in the ranking table. Pasting the relative links with the base, I get the full links to all players of interest.

#read in base website
t2 <- read_html("https://www.atptour.com/en/rankings/singles?rankDate=2010-12-27")

#get the relative links to the players
rel_link <- t2 %>% html_nodes(".player-cell") %>% html_node("a") %>% html_attr("href")

#create the full links
links2 <- paste0("https://www.atptour.com", rel_link)

Now that I have the links to the players, the exercise was to get all the information I can find for the players from these sites. To this end, I created a gigantic function that does just that, from online social media, ranking, nationality and birthday to height and weight, I also scraped the statistics for the careers of the players and for the still active ones their statistics for 2021. I do not show this function in this report (it is simply too long).

With the help of this super function, it takes now only two lines of code to create the second final dataframe contatining detailed information for all tennis players that made into the top 100 ranked at the end of 2010.

#apply the function on the links
list_of_dfs2 <- lapply(links2, get_player)

#bind the lists for all dates in a dataframe
final_df2 <- rbindlist(list_of_dfs2)

#write to csv
write.csv(final_df2,"data/final_df2.csv", row.names = FALSE)

The dataframe can then be read and it looks like this (scroll to see all columns):

name current_rank nationality facebook instagram twitter youtube website birthday age turned_pro weight_kg weight_lbs height height_cm birth_city birth_country handed style coach(es) rank2021 win2021 lose2021 titles2021 prizemoney2021 bestrank bestranktime win lose titles prizemoney
Rafael Nadal 6 ESP https://www.facebook.com/Nadal https://www.instagram.com/rafaelnadal/ https://twitter.com/RafaelNadal NA http://www.rafaelnadal.com 1986-06-03 35 2001 85 187 6’1 185 Manacor Mallorca Left-Handed Two-Handed Backhand Carlos Moya, Francisco Roig 6 24 5 2 1478830 1 2008-08-18 1028 209 88 124961595
Roger Federer 16 SUI https://www.facebook.com/Federer https://www.instagram.com/rogerfederer/ https://twitter.com/rogerfederer NA http://www.rogerfederer.com 1981-08-08 40 1998 85 187 6’1 185 Basel Switzerland Right-Handed One-Handed Backhand Ivan Ljubicic, Severin Luthi 16 9 4 0 647655 1 2004-02-02 1251 275 103 130594339
Novak Djokovic 1 SRB https://www.facebook.com/djokovicofficial/ https://www.instagram.com/djokernole/ https://twitter.com/DjokerNole https://www.youtube.com/djokovicofficial http://novakdjokovic.com 1987-05-22 34 2003 77 170 6’2 188 Belgrade Serbia Right-Handed Two-Handed Backhand Marian Vajda, Goran Ivanisevic 1 55 7 5 9069225 1 2011-07-04 989 199 86 154756726
Andy Murray 134 GBR https://www.facebook.com/andymurrayofficial/ https://www.instagram.com/andymurray/ https://twitter.com/andy_murray NA http://www.andymurray.com 1987-05-15 34 2005 82 181 6’3 191 Glasgow Scotland Right-Handed Two-Handed Backhand Jamie Delgado 134 15 14 0 514248 1 2016-11-07 691 214 46 62314306
Robin Soderling NA SWE NA NA https://twitter.com/RSoderling NA http://robinsoderling.se 1984-08-14 NA 2001 87 192 6’4 193 Tibro Sweden Right-Handed Two-Handed Backhand NA NA NA NA NA NA 4 2010-11-15 310 170 10 10423124

Visualizations

This dataset is a much more detailed one, with plenty of interesting insights to potentially show. I decided to limit these to three especially interesting ones.

For the first, I decided to create a new variable that counts the number of social media platforms that a user has. The role social media platform plays in the value and success of sport players became an interesting topic in sport sciences over the past couple of years and so I was curious of the distribution. Of course the number of followers would probably tell even more, but I still found it interesting to look at how common it is among tennis players to build their brands in various different platforms. As can be seen on the histogram, on average they use 2.31 sites, but some use 5 different ones and others do not have any.

For the second plot, I again was curious about a distribution, but now I wanted to see at what age players reach their top position on the ranking table. For this, I again needed to create a new variable, in which I subtracted the birthdate of the players from the time when they were at their peak in terms of place in the ranking table.

As can be seen on the figure, tennis players generally tend to reach their peak at around 27 years of age, but some interesting extreme values are also worthwhile to note: There are some players who reached their peak at 20, while for some others it took much more time, way into their 30s up to 36 years of age.

Finally, I also wanted to create a plot that shows the relationship between variables in the dataframe. To this end, I visualize a scatter plot with the number of wins the players have had over their career on the x axis and the prizemoney they collected. The points are then colored based on the style they play. What we can see on this chart is not atypical of single player sports. There are some players who are way above the others, both in number of wins and money collected. The notable ones are Roger Federer, who has the most wins and Novak Djokovic, who collected the most money, but also Rafael Nadal and Andy Murray are notable names. Among the rest, there is also a clear positive relationship between the number of wins and prize money collected. Regarding the style, most players play with two0handed backhand, but there are some who lpay one-handed backhand (the most notale being Federer), and for some it is unknown.

Conclusions

To conclude, in this project, I downloaded the top 100 tennis player historically and selected the year 2010 and downloaded also all the player statistics for that year. Using some interesting visuals, I was also able to learn about the data some interesting insights:

I looked at the evolution of the ranking table over time with the help of a racing bar chart. Then, I learned about who were the players that moved the most upwards and downwards on the ranking table throughout their careers (Pat Cash moved most up, while .. moved the most down). Next, I also looked at who are the best players of all time based on average ranking over their careers (Bjorn Borg tops the list). After this, I looked at the best and worst performing countries in tennis (interestingly Greece came on top, while Kenya is at the bottom).

For the detailed player statistics of the 2010 ranking table, I also got some insights through visuals. I learned that players on average have around 2-3 social media sites, and reach their peak performance at around the age of 27. Finally, I also looked at the relationship between the number of wins by players and the prize money they collected. There is a clear positive relationship with some exceptional players having much more wins and also much more prize money.